Skip to content

[AURON #1724] Support binary input for Spark substring function#2262

Open
lyne7-sc wants to merge 7 commits into
apache:masterfrom
lyne7-sc:fix/spark_substr
Open

[AURON #1724] Support binary input for Spark substring function#2262
lyne7-sc wants to merge 7 commits into
apache:masterfrom
lyne7-sc:fix/spark_substr

Conversation

@lyne7-sc
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #1724

Rationale for this change

Spark substring supports both string and binary inputs, while Auron previously mapped it to datafusion's Substr, which only handled string-compatible behavior and caused the Spark string/binary substring suite case to be excluded.

What changes are included in this PR?

  • Add native Spark_Substring ext function support for Utf8 and Binary inputs.
  • Route Spark Substring conversion through Spark_Substring and preserve the input data type.
  • Add native unit coverage for string and binary substring.
  • Re-enable string / binary substring function test case.

Are there any user-facing changes?

Yes. Spark SQL substring now supports binary input in native execution, matching Spark behavior.

How was this patch tested?

  • Added native unit tests for spark_substring.
  • Spark String Function Suite

Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, left a few comments

Comment thread native-engine/datafusion-ext-functions/src/spark_strings.rs Outdated
Comment thread native-engine/datafusion-ext-functions/src/spark_strings.rs Outdated
Comment thread native-engine/datafusion-ext-functions/src/spark_strings.rs
Comment thread native-engine/datafusion-ext-functions/src/spark_strings.rs Outdated
@lyne7-sc
Copy link
Copy Markdown
Contributor Author

@ShreyeshArangath Thanks for your review! I’ve updated the implementation accordingly.

Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM, just one question

Copy link
Copy Markdown
Contributor

@yew1eb yew1eb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rust implementation looks correct. One issue on the Scala side.

@@ -1025,10 +1025,10 @@ object NativeConverters extends Logging {
if pos.asInstanceOf[Int] > 0 && len.asInstanceOf[Int] >= 0 =>
val longPos = pos.asInstanceOf[Int].toLong
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard pos > 0 && len >= 0 was inherited from the old Substr mapping and is now too restrictive. spark_substring correctly handles pos == 0 (treated as 1) and negative pos (counted from the end), as the Rust unit tests already cover. Cases with pos <= 0 will still fall back to Spark instead of using the native path. Consider relaxing the guard to just len >= 0, or dropping it entirely since the implementation handles all integer values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! That makes sense to me. I’ve relaxed the guard so these cases can go through the native path.

Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failed] Substring does not support binary input

3 participants